AITopics | dom tree

Collaborating Authors

dom tree

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AutoFAIR : Automatic Data FAIRification via Machine Reading

Ma, Tingyan, Liu, Wei, Lu, Bin, Gan, Xiaoying, Zhu, Yunqiang, Fu, Luoyi, Zhou, Chenghu

arXiv.org Artificial IntelligenceAug-7-2024

The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

fair principle, information, metadata, (16 more...)

arXiv.org Artificial Intelligence

2408.04673

Country:

Europe (0.14)
North America (0.14)
Asia > China > Shanghai > Shanghai (0.06)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.46)
Information Technology (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path

Wang, Zilong, Shang, Jingbo

arXiv.org Artificial IntelligenceMay-23-2023

The rapid growth of web pages and the increasing complexity of their structure poses a challenge for web mining models. Web mining models are required to understand the semi-structured web pages, particularly when little is known about the subject or template of a new page. Current methods migrate language models to the web mining by embedding the XML source code into the transformer or encoding the rendered layout with graph neural networks. However, these approaches do not take into account the relationships between text nodes within and across pages. In this paper, we propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree which is a more accurate and efficient signal for key-value pair extraction within a web page. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages. We use the contrastive learning to address the issue of sparsity in relation extraction. Extensive experiments on public benchmarks show that our method, ReXMiner, outperforms the state-of-the-art baselines in the task of zero-shot relation extraction in web mining.

data mining, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.13805

Country:

North America > Canada (0.05)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining > Web Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

Lang, Qiwei, Zhou, Jingbo, Wang, Haoyi, Lyu, Shiqi, Zhang, Rui

arXiv.org Artificial IntelligenceMay-9-2023

The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Artificial Intelligence

2305.05378

Country:

Asia > China > Jilin Province > Changchun (0.05)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

CoVA: Context-aware Visual Attention for Webpage Information Extraction

Kumar, Anurendra, Morabia, Keval, Wang, Jingjin, Chang, Kevin Chen-Chuan, Schwing, Alexander

arXiv.org Artificial IntelligenceOct-23-2021

Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale dataset of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and background. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.

extraction, representation, webpage, (14 more...)

arXiv.org Artificial Intelligence

2110.1232

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Illinois (0.05)
Asia (0.04)

Genre: Research Report (0.84)

Industry: Information Technology > Services > e-Commerce Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(3 more...)

Add feedback

Parallel Performance-Energy Predictive Modeling of Browsers: Case Study of Servo

Zambre, Rohit, Bergstrom, Lars, Beni, Laleh Aghababaie, Chandramowliswharan, Aparna

arXiv.org Machine LearningFeb-6-2020

Mozilla Research is developing Servo, a parallel web browser engine, to exploit the benefits of parallelism and concurrency in the web rendering pipeline. Parallelization results in improved performance for pinterest.com but not for google.com. This is because the workload of a browser is dependent on the web page it is rendering. In many cases, the overhead of creating, deleting, and coordinating parallel work outweighs any of its benefits. In this paper, we model the relationship between web page primitives and a web browser's parallel performance using supervised learning. We discover a feature space that is representative of the parallelism available in a web page and characterize it using seven key features. Additionally, we consider energy usage trade-offs for different levels of performance improvements using automated labeling algorithms. Such a model allows us to predict the degree of parallelism available in a web page and decide whether or not to render a web page in parallel. This modeling is critical for improving the browser's performance and minimizing its energy usage. We evaluate our model by using Servo's layout stage as a case study. Experiments on a quad-core Intel Ivy Bridge (i7-3615QM) laptop show that we can improve performance and energy usage by up to 94.52% and 46.32% respectively on the 535 web pages considered in this study. Looking forward, we identify opportunities to apply this model to other stages of a browser's architecture as well as other performance- and energy-critical devices.

browser, dom tree, servo, (15 more...)

arXiv.org Machine Learning

doi: 10.1109/HiPC.2016.013

2002.0385

Country:

North America > United States > California > Orange County > Irvine (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Media (0.46)
Information Technology > Services (0.34)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.34)

Add feedback

QWeb: Solving Web Navigation Problems using DQN

#artificialintelligenceOct-20-2019, 19:54:21 GMT

We first formulate the MDP for our problem, M S,? QWeb solves the above problem using deep Q network(DQN) to generate Q values for each state and for each atomic action. The training process is almost the same as traditional DQN with the help of reward augmentation and some curriculum learning approaches, which we will discuss later. But for now let's first focus on the architecture of QWeb, which is essentially the most fruitful part of this algorithm. Encoding user instructions: As we've seen in the preliminaries, a user instruction consists of a list of fields, i.e.,key-value pairs K, V .

dom element, instruction, instruction field, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Using Machine Learning to Analyze Landing Pages

#artificialintelligenceApr-3-2019, 23:50:11 GMT

We had hoped when we started this project that with a semantic element as simple as headline, human experts would tend to agree on what constitutes a headline. As we've seen above, this turns out not to be the case.

artificial intelligence, machine learning, node, (17 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Learning to Navigate the Web

Gur, Izzeddin, Rueckert, Ulrich, Faust, Aleksandra, Hakkani-Tur, Dilek

arXiv.org Machine LearningDec-21-2018

Learning in environments with large state and action spaces, and sparse rewards, can hinder a Reinforcement Learning (RL) agent's learning through trial-and-error. For instance, following natural language instructions on the Web (such as booking a flight ticket) leads to RL settings where input vocabulary and number of actionable elements on a page can grow very large. Even though recent approaches improve the success rate on relatively simple environments with the help of human demonstrations to guide the exploration, they still fail in environments where the set of possible instructions can reach millions. We approach the aforementioned problems from a different perspective and propose guided RL approaches that can generate unbounded amount of experience for an agent to learn from. Instead of learning from a complicated instruction with a large vocabulary, we decompose it into multiple sub-instructions and schedule a curriculum in which an agent is tasked with a gradually increasing subset of these relatively easier sub-instructions. In addition, when the expert demonstrations are not available, we propose a novel meta-learning framework that generates new instruction following tasks and trains the agent more effectively. We train DQN, deep reinforcement learning agent, with Q-value function approximated with a novel QWeb neural network architecture on these smaller, synthetic instructions. We evaluate the ability of our agent to generalize to new instructions on World of Bits benchmark, on forms with up to 100 elements, supporting 14 million possible instructions. The QWeb agent outperforms the baseline without using any human demonstration achieving 100% success rate on several difficult environments.

agent, dom element, instruction, (16 more...)

arXiv.org Machine Learning

1812.09195

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Design of Automatically Adaptable Web Wrappers

Ferrara, Emilio, Baumgartner, Robert

arXiv.org Artificial IntelligenceMar-7-2011

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches of Web data extraction, for example using techniques of artificial intelligence or machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision of information extracted from Web pages, and, at the same time, have to prove robustness in order not to compromise quality and reliability of data themselves. In this paper we focus on some experimental aspects related to the robustness of the data extraction process and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for finding similarities between two different version of a Web page, in order to handle modifications, avoiding the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate performances, advantages and draw-backs of our novel system of automatic wrapper adaptation.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

1103.1254

Country:

North America > United States (0.47)
Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
(2 more...)

Add feedback